feat: add RoboSpatial task by njb-nvidia · Pull Request #1347 · EvolvingLMMs-Lab/lmms-eval

njb-nvidia · 2026-05-20T22:46:01Z

Summary

Adds RoboSpatial, a spatial-reasoning benchmark for robotic manipulation scenes (RoboSpatial-Home) covering three sub-categories:

compatibility — 105 items
configuration — 123 items
context — 122 items

Total: 350 items.

This port exposes:

`robo_spatial` (group)
`robo_spatial_all` (union of all three splits via `dataset_kwargs.data_files` with `verification_mode: no_checks`)
`robo_spatial_compatibility` / `robo_spatial_configuration` / `robo_spatial_context` (single-category sub-tasks via `_default_template.yaml`)

Metric: `robo_spatial_score` — task-specific scoring (point / region / affordance correctness; see `pre_process.py` for parsing).

Files

`lmms_eval/tasks/robo_spatial/_default_template.yaml` — shared task config.
`lmms_eval/tasks/robo_spatial/robo_spatial.yaml` — group definition.
`lmms_eval/tasks/robo_spatial/robo_spatial_all.yaml` — concatenated test split.
`lmms_eval/tasks/robo_spatial/robo_spatial_{compatibility,configuration,context}.yaml` — per-category tasks.
`lmms_eval/tasks/robo_spatial/utils.py` — doc transforms, scoring, aggregation.
`lmms_eval/tasks/robo_spatial/pre_process.py` — answer parsing helpers.

Parity vs. local fork

Qwen3-VL-2B-Instruct, full test split on 8x H100, greedy decoding.

Source	Compat	Config	Context	Overall (350)
Fork	0.610	0.675	0.320	0.5314
Upstream	0.629	0.732	0.320	0.5571

Per-doc analysis on the 309 shared questions matched by doc_id: 91.9% identical scores.

Delta (+2.6pp overall) is consistent with the qwen3_vl model-class drift we have observed on other ports (e.g. metavqa, egoplan2).

Test plan

`uv run lmms-eval --tasks robo_spatial_all --limit 8` smoke
Full run on 8x H100 with Qwen3-VL-2B-Instruct; per-category scores match the fork within noise
`combined` split assembly via `dataset_kwargs.data_files + verification_mode: no_checks` verified end-to-end (350 docs loaded as expected)

RoboSpatial is a spatial-reasoning benchmark for robotic manipulation scenes (RoboSpatial-Home) covering three sub-categories: compatibility, configuration, and context. Dataset: chanhee-luke/RoboSpatial-Home on HuggingFace. Per-category splits: compatibility (105), configuration (123), context (122) (350 items total). This port exposes: - robo_spatial (group) - robo_spatial_all (union of all three splits via dataset_kwargs.data_files) - robo_spatial_compatibility / robo_spatial_configuration / robo_spatial_context Metric: robo_spatial_score — task-specific scoring implemented in utils.py (point/region/affordance correctness; see pre_process.py for parsing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add RoboSpatial task#1347

feat: add RoboSpatial task#1347
njb-nvidia wants to merge 1 commit into
EvolvingLMMs-Lab:mainfrom
njb-nvidia:add-robo_spatial-task

njb-nvidia commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

njb-nvidia commented May 20, 2026

Summary

Files

Parity vs. local fork

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant